Automatic Acquisition of Digitized Newspapers via Internet

نویسندگان

  • Ismael Sanz
  • Rafael Berlanga
  • María José Aramburu
  • Francisco Toledo
چکیده

After our previous works on modelling a database of newspapers and designing a specially suited retrieval language, we are now developing an application to automatically acquire, summarize and store newspaper documents published in distinct web resources. This paper describes the current implementation of the acquisition process which includes the recognison of document types and the abstraction of the recognised document values. The network agents in charge of such a process are called gatherers, accordingly to the terminology used in successful web retrieval systems such as Harvest. To implement gatherers we have combined a context free grammar with some web traversing techniques, which are available in most of the current PROLOG systems (e.g. Sicstus with the library PiLLoW).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Text Processing and Supporting Access to Collections of Digitized Historical Newspapers

Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we desc...

متن کامل

Improving Access to Digitized Historical Newspapers with Text Mining, Coordinated Models, and Formative User Interface Design

Most tools for accessing digitized historical newspapers emphasize relatively simple search; but, as increasing numbers of digitized historical newspapers and other historical resources become available, we can consider much richer modes of interaction with these collections. For instance, users might use exploratory search for looking at larger issues and events such as elections and campaigns...

متن کامل

Automated Processing of Digitized Historical Newspapers: Identification of Segments and Genres

Many historical newspapers are being digitized. We aim to support access to them via text analysis of the OCRd content. However, the OCR includes many errors; so extracting meaningful content from it is difficult. A pipeline of processing steps is proposed. Here, we describe the first two steps: segmentation and genre identification. The segmentation procedure based on headings was quite succes...

متن کامل

Automated Processing of Digitized Historical Newspapers beyond the Article Level: Sections and Regular Features

Millions of pages of historical newspapers have been digitized but in most cases access to these are supported by only basic search services. We are exploring interactive services for these collections which would be useful for supporting access, including automatic categorization of articles. Such categorization is difficult because of the uneven quality of the OCR text, but there are many clu...

متن کامل

Pivaj: an Article-centered Platform for Digitized Newspapers Newspapers Layout

PIVAJ is a platform for archived digitized newspaper emphasizing articles: extracting them from digitized documents by automated page layout analysis, OCRing them, indexing their text transcription to allow users to search for content. Crowdsourcing is used to improve the quality of the indexing, by correcting the transcription and by tagging articles with keywords. The platform has been used t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997